Cannabis sativa SimpleTidy_GeneCoEx
Initial data overview
We found a total of 130 RNAseq datasets for Cannabis sativa available in NCBI, belonging to 6 different projects, 6 different tissues/organs, and sequenced from both single- and paired-end libraries (with variable read sizes, from 35 to 150).
| BioProject | Number of Samples |
|---|---|
| PRJNA1108719 | 3 |
| PRJNA1126191 | 20 |
| PRJNA1173925 | 12 |
| PRJNA560453 | 64 |
| PRJNA73819 | 8 |
| PRJNA80055 | 23 |
| Tissue | Number of Samples |
|---|---|
| Flower | 50 |
| Leaf | 19 |
| Meristem | 20 |
| Root | 10 |
| Stem | 7 |
| Trichome | 24 |
| Sequencing layout | Number of Samples |
|---|---|
| PE | 104 |
| SE | 26 |
Read trimming and filtering (fastp) was repeated to include the shorter-read samples:
- for samples with original average read length < 40 bases, the minimum size allowed after trimming was 30 bases
- for samples with original average read length > 40 bases, the minimum size allowed after trimming was 50 bases.
Read mapping, counting and TPM computation
Reads were aligned to C. sativa’s female genome (downloaded from Ensembl Plants release 61: genome and GTF annotation) using STAR.
Given the difference in average read lengths between the datasets, the genome indexing step was taking using four different values for the --sjdbOverhang parameter: 30, 70, 90 and 144.
The option --quantMode was added to STAR to obtain read counts. Strandness was infered using RSeQC infer_experiment.py script, alignment BAM files and the genome annotation in BED format.
Subsequently, read counts (for each replicate, according to their strandness) were combined into a count matrix using R 4.4.0 and the tidyverse package.
For TPM computation, the GTF file was used to compute gene lengths (through exon overlap) (using R packages rtracklayer and GenomicRanges).
PCA and gene filtering
Before computing a minimum expression threshold, a principal component analysis was performed to detect potentially aberrant samples (eg. samples of the same bioproject clustering together despite originating from different tissues). For this, genes that were not detect in any sample were removed (overall read count ); then PCA was computed with (log10(TPM+1)).
Since no clear issues were visible, the analysis proceeded with all datasets.
Filtering
After filtering weakly expressed genes, 10617 were removed from the dataset and 16632 were kept (61.04% of all genes).
PCA post gene filtering
No obvious difference was observed in the PCA before and after gene filtering, indicating that most of the genes removed from the dataset had TPM = 0 in all samples (as those genes were removed prior to the initial PCA).
Gene expression variation
To detect clusters of genes co-expressed in different tissues, the genes with most variable expression in relation to the tissues were selected.
For this, TPM levels for each gene were averaged for each tissue, then the relative variance (\(variance/mean\)) of the averaged TPM values was computed. Next, the top 20% genes with highest relative variance of TPM were selected.
Of the 35 bait genes, 15 were included in the network construction.
Gene correlation
With the most variable genes selected, TPM levels were transformed into z-scores (to make the expression levels of different genes comparable), then Pearson correlation score was computed. The higher the correlation score, the more similar the expression of two genes across the different tissues in the dataset.
To filter the correlation table, correlation scores were first binned by rounding to 5 significant digits; for each bin, the total number of edges (TE) and the number of significant edges (SE) found with r scores equal or greater than the current bin were counted, then SE was divided by TE to obtain the cumulative proportion of significant correlations. Finally, the correlation threshold was selected as the lowest positive r score with cumulative proportion of significant correlations > 0.9.
- TE = total number of edges
- SE = number of significant edges
- CPSC = cumulative proportion of significant correlations
- rbin = binned r score values
- r = r score
\[CPSC = \frac{ SE~r \geq rbin~} {TE~r \geq rbin~ }\]
r threshold = lowest positive rbin value with CPSC > 0.9
Correlation threshold = 0.9868
Network construction
The network requires genes and their correlations (a correlation threshold) and a resolution level.
In a network, a cluster is a group of genes correlated to each other. When few genes are correlated to each other in a cluster, a low resolution may allow the distinction of the individual genes or of smaller clusters. Clusters formed by strongly correlated genes (higher correlation scores) require higher resolution to distinguish individual genes or of smaller clusters.
To optimize the resolution parameter, the function cluster_leiden was repeated using multiple resolution parameters (from 0.25 to 5, with steps of 0.25). Then, clusters with fewer than 5 genes were discarded and the number of genes in each detected cluster and the number of detected clusters was computed. For this step, the resolution selected was the highest one which kept all genes used in the network construction.
With resolution set to 0.5 , the genes of interest were found in clusters 4, 5, 6, 8, 9, 34 and 37 (table 2).